Cross-validation experiments with networks

pathpy provides basic support for evaluations based on cross-validation experiments. In particular, the train_test_split method can be used to create train and test splits. The semantics of the method as well as the arguments is similar to the corresponding function in ``sklearn` <https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html>`__.

To demonstrate the use, we generate a random graph:

import pathpy as pp

n = pp.generators.ER_np(100, 0.04)
print(n)
n.plot()
Uid:                        0x7f8c70080d30
Type:                       Network
Directed:           False
Multi-Edges:                False
Number of nodes:    100
Number of edges:    215

To generate a test and train network instance, where the test network contains a random fraction of 25 % of the nodes, we can write:

test, train = pp.algorithms.evaluation.train_test_split(n, test_size = 0.25)
print(test)
print(train)
Uid:                        0x7f8c70080d30_test
Type:                       Network
Directed:           False
Multi-Edges:                False
Number of nodes:    25
Number of edges:    19
Uid:                        0x7f8c70080d30_train
Type:                       Network
Directed:           False
Multi-Edges:                False
Number of nodes:    75
Number of edges:    124

The method generates two new Network instances that refer to the same node and edge objects as the original network, i.e. the new objects do not consume a lot of memory. The original network instance is not changed. The uids of the newly generated networks will be set to the original uid with a suffix of _test and _train respectively.

By default, the split will be made based on the nodes, and the train and test networks will include all incident edges for the corresponding node sets. This implies that some edges can be lost if the split is made along the endpoints. To preserve the number of edges, we can set the split method to edge. This will sample a random fraction of edges, and all nodes are added to both networks, i.e. the node sets between the two networks are identical. The sum of the edges of the training and test network equals the number of edges in the original network.

test, train = pp.algorithms.evaluation.train_test_split(n, test_size = 0.25, split='edge')
print(test)
print(train)
Uid:                        0x7f8c70080d30_test
Type:                       Network
Directed:           False
Multi-Edges:                False
Number of nodes:    100
Number of edges:    53
Uid:                        0x7f8c70080d30_train
Type:                       Network
Directed:           False
Multi-Edges:                False
Number of nodes:    100
Number of edges:    162

We can alternatively set the size of the training set:

test, train = pp.algorithms.evaluation.train_test_split(n, train_size = 0.25, split='edge')
print(test)
print(train)
Uid:                        0x7f8c70080d30_test
Type:                       Network
Directed:           False
Multi-Edges:                False
Number of nodes:    100
Number of edges:    161
Uid:                        0x7f8c70080d30_train
Type:                       Network
Directed:           False
Multi-Edges:                False
Number of nodes:    100
Number of edges:    54

Apart from static networks, we can also create cross-validation sets for temporal networks. For this, we first load a temporal network from the KONECT database:

tn = pp.io.konect.read_konect_name('sociopatterns-hypertext')
print(tn)
tn.plot()
Uid:                        0x7f8c9139e280
Type:                       TemporalNetwork
Directed:           False
Multi-Edges:                True
Number of unique nodes:     113
Number of unique edges:     2196
Number of temp nodes:       113
Number of temp edges:       20818
Observation periode:        1246255220 - 1246467561.0

Network attributes
------------------
category:   HumanContact
code:       HY
name:       Hypertext 2009
description:        Visitor–visitor face-to-face contacts
extr:       sociopatterns
url:        http://www.sociopatterns.org/
long-description:   This is the network of face-to-face contacts of the attendees of the ACM Hypertext 2009 conference. The ACM Conference on Hypertext and Hypermedia 2009 (HT 2009, http://www.ht2009.org/) was held in Turin, Italy over three days from June 29 to July 1, 2009. In the network, a node represents a conference visitor, and an edge represents a face-to-face contact that was active for at least 20 seconds. Multiple edges denote multiple contacts. Each edge is annotated with the time at which the contact took place.
entity-names:       visitor
relationship-names: contact
cite:       konect:sociopatterns
time:       2009-06-29/2009-07-01
timeiso:    2009-06-29/2009-07-01

We can call the same function on a temporal network instance. By default, the split will be made based on the observed interactions, i.e. in the following example the first 75 % of all time-stamped interactions will be included in the training network, while the last 25 % will be included in the test network.

test, train = pp.algorithms.evaluation.train_test_split(tn, test_size=0.25)
print(train)
print(test)
Uid:                        0x7f8c9139e280_train
Type:                       TemporalNetwork
Directed:           False
Multi-Edges:                True
Number of unique nodes:     112
Number of unique edges:     1854
Number of temp nodes:       112
Number of temp edges:       15614
Observation periode:        1246255220 - 1246441061.0
Uid:                        0x7f8c9139e280_test
Type:                       TemporalNetwork
Directed:           False
Multi-Edges:                True
Number of unique nodes:     95
Number of unique edges:     713
Number of temp nodes:       95
Number of temp edges:       5204
Observation periode:        1246441080 - 1246467561.0
train.plot()
test.plot()

We can also split based on the observed time, i.e. here we include all interactions ocurring within in the first 75 % of the observed time period in the training network, while the remaining interactions are included in the test network.

test, train = pp.algorithms.evaluation.train_test_split(tn, test_size=0.25, split='time')
print(train)
print(test)
Uid:                        0x7f8c9139e280_train
Type:                       TemporalNetwork
Directed:           False
Multi-Edges:                True
Number of unique nodes:     113
Number of unique edges:     2196
Number of temp nodes:       113
Number of temp edges:       20815
Observation periode:        1246255220 - 1246467541.0
Uid:                        0x7f8c9139e280_test
Type:                       TemporalNetwork
Directed:           False
Multi-Edges:                True
Number of unique nodes:     5
Number of unique edges:     3
Number of temp nodes:       5
Number of temp edges:       3
Observation periode:        1246467560 - 1246467561.0